{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Groundedness Evaluation\n",
    "\n",
    "- Author: [Sungchul Kim](https://github.com/rlatjcj)\n",
    "- Peer Review: [Park Jeong-Ki](https://github.com/jeongkpa), [BokyungisaGod](https://github.com/BokyungisaGod)\n",
    "- This is a part of [LangChain-OpenTutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n",
    "\n",
    "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/16-Evaluations/11-Groundedness-Evaluation.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/16-Evaluations/11-Groundedness-Evaluation.ipynb)\n",
    "\n",
    "## Overview\n",
    "\n",
    "**Groundedness Evaluator** an Evaluator that assesses whether answers are accurate based on the given context.\n",
    "This Evaluator can be used to evaluate hallucinations in RAG's responses.\n",
    "In this tutorial, we will look at how to evaluate Groundedness using Upstage Groundedness Checker (`UpstageGroundednessCheck`) and a custom-made Groundedness Checker.\n",
    "\n",
    "\n",
    "### Table of Contents\n",
    "\n",
    "- [Overview](#overview)\n",
    "- [Environment Setup](#environment-setup)\n",
    "- [Set Groundedness Checkers](#set-groundedness-checkers)\n",
    "- [Comprehensive evaluation of dataset using summary evaluators](#comprehensive-evaluation-of-dataset-using-summary-evaluators)\n",
    "\n",
    "### References\n",
    "\n",
    "- [LangSmith Evaluation](https://docs.smith.langchain.com/evaluation)\n",
    "- [Upstage Groundedness Checker](https://console.upstage.ai/docs/capabilities/groundedness-check)\n",
    "----"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "Setting up your environment is the first step. See the [Environment Setup](https://wikidocs.net/257836) guide for more details.\n",
    "\n",
    "\n",
    "**[Note]**\n",
    "\n",
    "The `langchain-opentutorial` is a package of easy-to-use environment setup guidance, useful functions and utilities for tutorials.\n",
    "Check out the  [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture --no-stderr\n",
    "%pip install langchain-opentutorial"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install required packages\n",
    "from langchain_opentutorial import package\n",
    "\n",
    "package.install(\n",
    "    [\n",
    "        \"langsmith\",\n",
    "        \"langchain_core\",\n",
    "        \"langchain_community\",\n",
    "        \"langchain_text_splitters\",\n",
    "        \"langchain_openai\",\n",
    "        \"langchain_upstage\",\n",
    "        \"pymupdf\",\n",
    "        \"faiss-cpu\",\n",
    "    ],\n",
    "    verbose=False,\n",
    "    upgrade=False,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Environment variables have been set successfully.\n"
     ]
    }
   ],
   "source": [
    "# Set environment variables\n",
    "from langchain_opentutorial import set_env\n",
    "\n",
    "set_env(\n",
    "    {\n",
    "        \"OPENAI_API_KEY\": \"\",\n",
    "        \"LANGCHAIN_API_KEY\": \"\",\n",
    "        \"LANGCHAIN_TRACING_V2\": \"true\",\n",
    "        \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n",
    "        \"LANGCHAIN_PROJECT\": \"Groundedness-Evaluations\",\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.\n",
    "\n",
    "[Note] This is not necessary if you've already set the required API keys in previous steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load API keys from .env file\n",
    "from dotenv import load_dotenv\n",
    "\n",
    "load_dotenv(override=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define a function for RAG performance testing\n",
    "\n",
    "Let's create an RAG system that will be used for testing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import Callable\n",
    "\n",
    "from langchain_community.document_loaders import PyMuPDFLoader\n",
    "from langchain_community.vectorstores import FAISS\n",
    "from langchain_core.output_parsers import StrOutputParser\n",
    "from langchain_core.prompts import PromptTemplate\n",
    "from langchain_core.runnables import RunnablePassthrough\n",
    "from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n",
    "from langchain_text_splitters import RecursiveCharacterTextSplitter\n",
    "\n",
    "\n",
    "def ask_question_with_llm(llm: ChatOpenAI, file_path: str) -> Callable[[dict], dict]:\n",
    "    \"\"\"Create a function that answers questions.\n",
    "    \n",
    "    Args:\n",
    "        llm (ChatOpenAI): Language Model\n",
    "        file_path (str): Path to the PDF file\n",
    "    \n",
    "    Returns:\n",
    "        (Callable[[dict], dict]): A function that answers questions\n",
    "    \"\"\"\n",
    "    \n",
    "    # Load documents\n",
    "    loader = PyMuPDFLoader(file_path)\n",
    "    docs = loader.load()\n",
    "    \n",
    "    # Split given documents\n",
    "    text_splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=50)\n",
    "    split_documents = text_splitter.split_documents(docs)\n",
    "\n",
    "    # Create a retriever\n",
    "    embeddings = OpenAIEmbeddings(model=\"text-embedding-3-small\")\n",
    "    vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)\n",
    "    retriever = vectorstore.as_retriever()\n",
    "\n",
    "    # Create a chain\n",
    "    prompt = PromptTemplate.from_template(\n",
    "    \"\"\"\n",
    "    You are an assistant for question-answering tasks. \n",
    "    Use the following pieces of retrieved context to answer the question. \n",
    "    If you don't know the answer, just say that you don't know. \n",
    "\n",
    "    #Context: \n",
    "    {context}\n",
    "\n",
    "    #Question:\n",
    "    {question}\n",
    "\n",
    "    #Answer:\n",
    "    \"\"\"\n",
    "    )\n",
    "\n",
    "    rag_chain = (\n",
    "        {\n",
    "            \"context\": retriever,\n",
    "            \"question\": RunnablePassthrough(),\n",
    "        }\n",
    "        | prompt\n",
    "        | llm\n",
    "        | StrOutputParser()\n",
    "    )\n",
    "\n",
    "    def _ask_question(inputs: dict) -> dict:\n",
    "        # Search for context related to the question\n",
    "        context = retriever.invoke(inputs[\"question\"])\n",
    "        # Combine the retrieved documents into a single string\n",
    "        context = \"\\n\".join([doc.page_content for doc in context])\n",
    "        # Return a dictionary containing the question, context, and answer\n",
    "        return {\n",
    "            \"question\": inputs[\"question\"],\n",
    "            \"context\": context,\n",
    "            \"answer\": rag_chain.invoke(inputs[\"question\"]),\n",
    "        }\n",
    "\n",
    "    return _ask_question\n",
    "\n",
    "gpt_chain = ask_question_with_llm(\n",
    "    llm=ChatOpenAI(model=\"gpt-4o-mini\", temperature=0),\n",
    "    file_path=\"data/Newwhitepaper_Agents2.pdf\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Set Groundedness Checkers\n",
    "\n",
    "To evaluate groundedness, `UpstageGroundednessCheck` and `Custom Groundedness Checker` will be used."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set `UpstageGroundednessCheck`\n",
    "\n",
    "To use Upstage's Groundedness Checker (`UpstageGroundednessCheck`), you need to obtain an API key from the link below.\n",
    " - [Get API Key](https://console.upstage.ai/api-keys)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_upstage import UpstageGroundednessCheck\n",
    "\n",
    "# Create Upstage Groundedness Checker\n",
    "upstage_groundedness_check = UpstageGroundednessCheck()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "grounded\n"
     ]
    }
   ],
   "source": [
    "# Run Groundedness Checker for evaluation\n",
    "request_input = {\n",
    "    \"context\": \"Teddy's gender is male and he operates the Teddynote YouTube channel.\",\n",
    "    \"answer\": \"Teddy is a male.\",\n",
    "}\n",
    "\n",
    "response = upstage_groundedness_check.invoke(request_input)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "notGrounded\n"
     ]
    }
   ],
   "source": [
    "# Run Groundedness Checker for evaluation\n",
    "request_input = {\n",
    "    \"context\": \"Teddy's gender is male and he operates the Teddynote YouTube channel.\",\n",
    "    \"answer\": \"Teddy is a female.\",\n",
    "}\n",
    "\n",
    "response = upstage_groundedness_check.invoke(request_input)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Define `UpstageGroundednessCheck` Evaluator. It will be used in `evaluate` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langsmith.schemas import Run, Example\n",
    "from langsmith.evaluation import evaluate\n",
    "\n",
    "\n",
    "def upstage_groundedness_check_evaluator(run: Run, example: Example) -> dict:\n",
    "    # Get generated answer and context\n",
    "    answer = run.outputs.get(\"answer\", \"\")\n",
    "    context = run.outputs.get(\"context\", \"\")\n",
    "\n",
    "    # Check groundedness\n",
    "    groundedness_score = upstage_groundedness_check.invoke(\n",
    "        {\"answer\": answer, \"context\": context}\n",
    "    )\n",
    "    groundedness_score = groundedness_score == \"grounded\"\n",
    "\n",
    "    return {\"key\": \"groundedness_score\", \"score\": int(groundedness_score)}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set Custom Groundedness Checker\n",
    "\n",
    "Create a custom Groundedness Checker using OpenAI's model.  \n",
    "For this tutorial, we'll use the target `retrieval-answer`.\n",
    "If you want to use other targets ('question-answer' or 'question-retrieval'), you should change the description in `GroundednessScore` and the prompt template in `GroundednessChecker.create()` accordingly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain_core.runnables import RunnableSequence\n",
    "from langchain_openai import ChatOpenAI\n",
    "from langsmith.schemas import Example, Run\n",
    "from pydantic import BaseModel, Field\n",
    "\n",
    "\n",
    "class GroundednessScore(BaseModel):\n",
    "    \"\"\"Binary scores for relevance checks\"\"\"\n",
    "\n",
    "    score: str = Field(\n",
    "        description=\"relevant or not relevant. Answer 'yes' if the answer is relevant to the retrieved document else answer 'no'\"\n",
    "    )\n",
    "\n",
    "\n",
    "class GroundednessChecker:\n",
    "    \"\"\"This class is used to evaluate the accuracy of a document.\n",
    "    \n",
    "    It returns 'yes' or 'no' as the evaluation result.\n",
    "\n",
    "    Attributes:\n",
    "        llm (ChatOpenAI): Language Model instance.\n",
    "    \"\"\"\n",
    "\n",
    "    def __init__(self, llm: ChatOpenAI) -> None:\n",
    "        self.llm = llm\n",
    "\n",
    "    def create(self) -> RunnableSequence:\n",
    "        \"\"\"\n",
    "        Create a chain for evaluating accuracy.\n",
    "\n",
    "        Returns:\n",
    "            Chain: A chain object that can evaluate accuracy\n",
    "        \"\"\"\n",
    "        llm = self.llm.with_structured_output(GroundednessScore)\n",
    "\n",
    "        # 프롬프트 선택\n",
    "        template = \"\"\"You are a grader assessing relevance of a retrieved document to a user question. \\n \n",
    "            Here is the retrieved document: \\n\\n {context} \\n\\n\n",
    "            Here is the answer: {answer} \\n\n",
    "            If the document contains keyword(s) or semantic meaning related to the user answer, grade it as relevant. \\n\n",
    "            \n",
    "            Give a binary score 'yes' or 'no' score to indicate whether the retrieved document is relevant to the answer.\n",
    "        \"\"\"\n",
    "        input_vars = [\"context\", \"answer\"]\n",
    "\n",
    "        # Create a prompt\n",
    "        prompt = PromptTemplate(\n",
    "            template=template,\n",
    "            input_variables=input_vars,\n",
    "        )\n",
    "\n",
    "        # Create a chain\n",
    "        chain = prompt | llm\n",
    "        return chain\n",
    "\n",
    "# Create a Groundedness Checker\n",
    "custom_groundedness_check = GroundednessChecker(\n",
    "        ChatOpenAI(model=\"gpt-4o-mini\", temperature=0)\n",
    "    ).create()\n",
    "\n",
    "def custom_groundedness_check_evaluator(run: Run, example: Example) -> dict:\n",
    "    # Get generated answer and context\n",
    "    answer = run.outputs.get(\"answer\", \"\")\n",
    "    context = run.outputs.get(\"context\", \"\")\n",
    "\n",
    "    # Groundedness Check\n",
    "    groundedness_score = custom_groundedness_check.invoke(\n",
    "        {\"answer\": answer, \"context\": context}\n",
    "    )\n",
    "    groundedness_score = groundedness_score.score == \"yes\"\n",
    "\n",
    "    return {\"key\": \"groundedness_score\", \"score\": int(groundedness_score)}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate Groundedness using Upstage's and Custom Groundedness Checker\n",
    "\n",
    "Evaluate Groundedness using Upstage's and Custom Groundedness Checker.  \n",
    "Before this, check if the dataset exists created before.\n",
    "If you don't have the dataset, create one referring to [04-LangSmith-Dataset](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/16-Evaluations/04-LangSmith-Dataset.ipynb).  \n",
    "In this tutorial, we'll use custom Q&A dataset referred to [Google Whitepaper on AI Agents\n",
    "](https://media.licdn.com/dms/document/media/v2/D561FAQH8tt1cvunj0w/feedshare-document-pdf-analyzed/B56ZQq.TtsG8AY-/0/1735887787265?e=1736985600&v=beta&t=pLuArcKyUcxE9B1Her1QWfMHF_UxZL9Q-Y0JTDuSn38)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RAG_EVAL_DATASET\n",
      "inputs : What are the three targeted learnings to enhance model performance?\n",
      "outputs : The three targeted learning approaches to enhance model performance mentioned in the context are:\n",
      "\n",
      "1. In-context learning: This involves providing a generalized model with a prompt, tools, and few-shot examples at inference time, allowing it to learn \"on the fly.\"\n",
      "\n",
      "2. Fine-tuning based learning: This method involves training a model using a larger dataset of specific examples prior to inference, helping the model understand when and how to apply certain tools.\n",
      "\n",
      "3. Using external memory: This includes examples like the 'Example Store' in Vertex AI extensions or data stores in RAG-based architecture, which help models access specific knowledge.\n",
      "--------------------------------\n",
      "inputs : What are the key functions of an agent's orchestration layer in achieving goals?\n",
      "outputs : The key functions of an agent's orchestration layer in achieving goals include structuring reasoning, planning, and decision-making, as well as guiding the agent's actions. It involves taking in information, performing internal reasoning, and generating informed decisions or responses. The orchestration layer can utilize various reasoning techniques such as ReAct, Chain-of-Thought, and Tree-of-Thoughts to facilitate these processes. Additionally, it can leverage language models and external tools to transition through states and execute complex tasks autonomously.\n",
      "--------------------------------\n",
      "inputs : List up the name of the authors\n",
      "outputs : The authors are Julia Wiesinger, Patrick Marlow, and Vladimir Vuskovic.\n",
      "--------------------------------\n",
      "inputs : What is Tree-of-thoughts?\n",
      "outputs : Tree-of-thoughts (ToT) is a prompt engineering framework that is well suited for exploration or strategic lookahead tasks. It generalizes over chain-of-thought prompting and allows the model to explore various thought chains that serve as intermediate steps for general problem solving with language models.\n",
      "--------------------------------\n",
      "inputs : What is the framework used for reasoning and planning in agent?\n",
      "outputs : The frameworks used for reasoning and planning in agents include ReAct, Chain-of-Thought, and Tree-of-Thoughts. These frameworks provide a structure for the orchestration layer to perform internal reasoning and generate informed decisions or responses.\n",
      "--------------------------------\n",
      "inputs : How do agents differ from standalone language models?\n",
      "outputs : Agents can use tools to access real-time data and perform actions, whereas models rely solely on their training data.\n",
      "Agents maintain session history for multi-turn reasoning, while models lack native session management.\n",
      "Agents incorporate cognitive architectures for advanced reasoning, such as ReAct or Chain-of-Thought.\n",
      "--------------------------------\n"
     ]
    }
   ],
   "source": [
    "from langsmith.client import Client\n",
    "\n",
    "client = Client()\n",
    "datasets = client.list_datasets()\n",
    "\n",
    "for dataset in datasets:\n",
    "    print(dataset.name)\n",
    "    for example in client.list_examples(dataset_name=dataset.name):\n",
    "        print(\"inputs :\", example.dict()[\"inputs\"][\"question\"])\n",
    "        print(\"outputs :\", example.dict()[\"outputs\"][\"answer\"])\n",
    "        print(\"--------------------------------\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "View the evaluation results for experiment: 'GROUNDEDNESS_EVAL-929f91f7' at:\n",
      "https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=fd7bec12-fc0d-41d9-bf21-3db571a54f15\n",
      "\n",
      "\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "5837976c71f3427caf51bc16b2ec59df",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0it [00:00, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from langsmith.evaluation import evaluate\n",
    "\n",
    "dataset_name = \"RAG_EVAL_DATASET\"\n",
    "\n",
    "# Run evaluation\n",
    "experiment_results = evaluate(\n",
    "    gpt_chain,\n",
    "    data=dataset_name,\n",
    "    evaluators=[upstage_groundedness_check_evaluator, custom_groundedness_check_evaluator],\n",
    "    experiment_prefix=\"GROUNDEDNESS_EVAL\",\n",
    "    metadata={\n",
    "        \"variant\": \"Hallucination evaluation using Upstage's and Custom Groundedness Checker\",\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![langsmith-groundedness-evaluation-01](./assets/11-groundedness-evaluation-01.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Comprehensive evaluation of dataset using summary evaluators\n",
    "\n",
    "This is useful when you want to evaluate the entire dataset. (The previous step was for individual data evaluation.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "from typing import List\n",
    "from langsmith.schemas import Example, Run\n",
    "\n",
    "\n",
    "def upstage_groundedness_check_summary_evaluator(\n",
    "    runs: List[Run], examples: List[Example]\n",
    ") -> dict:\n",
    "    def is_grounded(run: Run) -> bool:\n",
    "        context = run.outputs[\"context\"]\n",
    "        answer = run.outputs[\"answer\"]\n",
    "        return (\n",
    "            upstage_groundedness_check.invoke({\"context\": context, \"answer\": answer})\n",
    "            == \"grounded\"\n",
    "        )\n",
    "\n",
    "    groundedness_scores = sum(1 for run in runs if is_grounded(run))\n",
    "    return {\"key\": \"groundedness_score\", \"score\": groundedness_scores / len(runs)}\n",
    "\n",
    "\n",
    "def custom_groundedness_check_summary_evaluator(\n",
    "    runs: List[Run], examples: List[Example]\n",
    ") -> dict:\n",
    "    def is_grounded(run: Run) -> bool:\n",
    "        context = run.outputs[\"context\"]\n",
    "        answer = run.outputs[\"answer\"]\n",
    "        return (\n",
    "            custom_groundedness_check.invoke({\"context\": context, \"answer\": answer}).score\n",
    "            == \"yes\"\n",
    "        )\n",
    "\n",
    "    groundedness_scores = sum(1 for run in runs if is_grounded(run))\n",
    "    return {\"key\": \"groundedness_score\", \"score\": groundedness_scores / len(runs)}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "View the evaluation results for experiment: 'GROUNDEDNESS_UPSTAGE_SUMMARY_EVAL-6d673c3a' at:\n",
      "https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=67a47286-2224-48d3-94d6-e2bb87a47aef\n",
      "\n",
      "\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "8e80160ea1eb4b75ba6f75df29ae9fdb",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0it [00:00, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "View the evaluation results for experiment: 'GROUNDEDNESS_CUSTOM_SUMMARY_EVAL-af449083' at:\n",
      "https://smith.langchain.com/o/9089d1d3-e786-4000-8468-66153f05444b/datasets/9b4ca107-33fe-4c71-bb7f-488272d895a3/compare?selectedSessions=72ae6cbd-e53b-4311-9c2a-654585da9c5b\n",
      "\n",
      "\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2953e903a235486d9072cdfcf5d73e62",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "0it [00:00, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from langsmith.evaluation import evaluate\n",
    "\n",
    "# Run evaluation\n",
    "experiment_result1 = evaluate(\n",
    "    gpt_chain,\n",
    "    data=dataset_name,\n",
    "    summary_evaluators=[\n",
    "        upstage_groundedness_check_summary_evaluator,\n",
    "    ],\n",
    "    experiment_prefix=\"GROUNDEDNESS_UPSTAGE_SUMMARY_EVAL\",\n",
    "    # Set experiment metadata\n",
    "    metadata={\n",
    "        \"variant\": \"Hallucination evaluation using Upstage Groundedness Checker\",\n",
    "    },\n",
    ")\n",
    "\n",
    "# Run evaluation\n",
    "experiment_result2 = evaluate(\n",
    "    gpt_chain,\n",
    "    data=dataset_name,\n",
    "    summary_evaluators=[\n",
    "        custom_groundedness_check_summary_evaluator,\n",
    "    ],\n",
    "    experiment_prefix=\"GROUNDEDNESS_CUSTOM_SUMMARY_EVAL\",\n",
    "    # Set experiment metadata\n",
    "    metadata={\n",
    "        \"variant\": \"Hallucination evaluation using Custom Groundedness Checker\",\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![langsmith-groundedness-evaluation-02](./assets/11-groundedness-evaluation-02.png)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
